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ABSTRACT 

The European Nucleotide Archive (ENA; http://www 
.ebi.ac.uk/ena), Europe's primary nucleotide se- 
quence resource, captures and presents globally 
comprehensive nucleic acid sequence and asso- 
ciated information. Covering the spectrum from 
raw data to assembled and functionally annotated 
genomes, the ENA has witnessed a dramatic growth 
resulting from advances in sequencing technology 
and ever broadening application of the method- 
ology. During 2011, we have continued to operate 
and extend the broad range of ENA services. In par- 
ticular, we have released major new functionality in 
our interactive web submission system, Webin, 
through developments in template-based submis- 
sions for annotated sequences and support for 
raw next-generation sequence read submissions. 

INTRODUCTION 

The European Nucleotide Archive (ENA) is maintained 
and developed at the European Molecular Biology 
Laboratory's European Bioinformatics Institute (EMBL- 
EBI) and serves as Europe's primary repository for nu- 
cleotide sequence and associated information. Content 
spans raw sequence reads from all sequencing platforms, 
read alignments, assembly information and submitted 
functional annotation. Providing both the permanent sci- 
entific record as a complement to literature publication 
process and a forum for early sharing of pre-publication 
data, the ENA serves as a critical foundation for the 
global bioinformatics data infrastructure. Globally com- 
prehensive coverage is assured through long-standing data 



exchange agreements with the DNA Databank of Japan 
(DDBJ) (1) and the United States National Institutes of 
Health National Center for Biotechnology Information 
(NCBI) (2) under the International Nucleotide Sequence 
Database Collaboration (3; http://www.insdc.org/). 

Underlying ENA are a number of core databases, 
including the Sequence Read Archive for raw reads and 
read alignments from next generation sequencing plat- 
forms (4) and EMBL-Bank for high level assembly infor- 
mation, assembled sequences and functional annotation. 
ENA services are numerous: we provide submission tools, 
both the web-based Webin system and programmatic inter- 
faces; we offer search technologies, such as the newly de- 
veloped rapid ENA sequence similarity search (http:// 
www.ebi.ac.uk/ena/search) and text-based search tools 
(http://www.ebi.ac.uk/ena); we present integrated access 
to all ENA content through the ENA Browser, which 
offers both web browsing and REST access (http:// 
www.ebi.ac.uk/ena/about/browser). We are highly re- 
sponsive in the development of new technologies and ser- 
vices to adapt to changes in sequencing technology and 
user requirements: we are leading a community-facing 
sequence read compression initiative, CRAM (5; http:// 
www.ebi.ac.uk/ena/about/cram_toolkit); we are develop- 
ing anencrypted BAM read alignment server that sup- 
ports reference coordinate-based lookups of controlled 
acess reads by region; we are active in the development 
of data warehousing methodologies to provide real-time 
access to the massive data sets that we store (e.g. the 
ENA Taxon Portal; http://www.ebi.ac.uk/ena/data/view/ 
TaxomEukaryota). 

In this article, we comment on content and report 
briefly on means by which ENA data can be accessed. 
We then focus on major developments in our Webin sub- 
mission system in the areas of template-based submissions 
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of annotated and assembled sequences and raw next 
generation sequence read submission. We also announce 
the introduction of a sequence length limit for submission 
of assembled sequences. 
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Figure 1. (A) Growth of assembled sequences (ENA:EMBL-Bank); see 
http://www.ebi.ac.Uk/ena/about/statistics#embl_growth for dynamically 
updated growth chart. (B) Growth of raw data from next generation 
sequencing platforms (ENA: SRA); see http://www.ebi.ac.uk/ena/ 
about/statistics#sra_growth for dynamically updated growth chart. 



ENA CONTENT 

At the time of going to press, ENA contains 
346 598 699 035 m of assembled sequence in 220 504 007 
assembled sequence entries (See EMBL-Bank release 
notes at http://www.ebi.ac.uk/embl/Documentation/ 
Release_notes/current/relnotes.html) and more than 
100 terabases of raw next generation sequence reads 
(Figure 1A and B). 

Notable datasets submitted to ENA during 201 1 include 
assemblies of Gorilla gorilla (FR853080-FR853106), 
atlantic cod, Gadus morhua (Project:41391), Vine, Vitis 
vinifera (Project: 18785), Takifugu rubripes (Project: 1434), 
Macaca fascicularis (FR874244-FR874264), medieval 
mitochondria and Yersinia plasmids (6; HE576978- 
HE576987), raw genomic reads from 18 lines of 
Arabidopsis thaliana (7; ERP000565), Staphylococcus 
aureus (8; ERP000528) and Mus musculus ES cells (9; 
ERP000570) and transcriptomicreads from multiple 
Silene species (10; ERP000371). 



ENA DATA ACCESS 

Full ENA content is made available through an integ- 
rated platform, the ENA Browser, that supports dis- 
covery (text search, sequence similarity search, taxon 
lookup, etc.) and retrieval of records interactively 
(through web browsing and programatically under 
RESTful URLs). Full details are available from http:// 
www.ebi.ac.uk/ena/about/browser. Records are made 
available in a selection of appropriate formats that include 
EMBL-Bank flat file, fasta and XML for assembled and 
annotated sequences, Fastq for sequence reads and 
Darwin Core for taxon records (http://www.ebi.ac.uk/ 
ena/about/formats). In addition, we support both ftp 
and Aspera protocols for network transfers of large raw 
data sets (ftp://ftp.sra.ebi.ac.uk) and offer a variety of 
data products over ftp for other areas of ENA content 
(ftp://ftp.ebi.ac.uk/pub/databases/embl and ftp://ftp.ebi. 
ac.uk/pub/databases/ena) 



Traditional versus templated submission systems 
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Figure 2. Usage of the different web-based interactive submission systems for annotated sequences at ENA between 2009 and 2011. 
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ANNOTATED AND ASSEMBLED 
SEQUENCESUBMISSIONS 

Apre-tailored template system was introduced in our 
Webin submission framework in 2009 for annotated se- 
quence submissions and has been expanded during 2011 
with the release of nine new templates. These templates 
have been designed for the most frequent types of sequence 
submissions and reached 15 in number in September 201 1. 
When using the templates, submitters provide nucleotide 
sequences with associated annotation through spread 
sheets or Fastq files with pre-defined mandatory and op- 
tional fields, a process that significantly reduces the overall 
complexity of the submissions process for both the sub- 
mitter and the ENA curator. Some advantages of the new 
system include the ability to choose from a small number 
of variables, functionalities that prevent the need for re- 
petitive entry of information constant across all records in 
a data set and straightforward validation before data sub- 
mission. The template concept has shown growing popu- 
larity since its launch versus the traditional system (which 
remains available for a limited time). Under the tradition- 
al system, submitters were able to annotate their entries 
with the full INSDC-approved features and qualifiers 
either one entry at a time or by defining with an ENA 
curator a specific template for each submission. This was 
useful for annotating small submissions in great detail but 
did not cater efficiently for larger-scale submissions of 
same-type data. Figure 2 shows the usage of the available 
submission systems between 2009 and 2011 and Table 1 
shows the currently available templates. 

As part of these developments, ENA is also facilitating 
the submission of marker gene sequences compliant with a 
community standard that has been developed by the 
Genomic Standards Consortium (GSC), called the 
Minimal Information about a MARKer gene Sequence 
Standard (MIMARKS) (11, 12). MIMARKS provides a 
minimal set of required information fields essential for 
downstream reuse of the data. The last two templates in 
Table 1 have been designed for submissions of 
MIMARKS-compliant data. 

Further improvements to the submissions system for 
annotated sequences will continue in 2012 and beyond. 



NEXT GENERATION SEQUENCE DATA 
SUBMISSIONS 

To complement the existing programmatic SRA REST 
submission interface, we have recently extended the 
Webin system to support submissions of raw next gener- 
ation sequencing reads to the SRA. Unlike the SRA REST 
interface, which is targeted for large-scale sequence sub- 
mitters and allows direct programmatic interaction be- 
tween external LIMS systems and the SRA database at 
EBI, this new component of Webin is designed for inter- 
active use. Users work through a web interface to create 
studies, samples and experiments, to update submitted 
metadata and to release previously submitted data to the 
public. Importantly, all metadata are submitted either 
by uploading or editing spreadsheets. While SRA REST 
submitters are fully exposed to the underlying SRA 



Table 1. Names and definitions of templates currently available for 
sequence submissions to EMBL-bank 



Template name 



Definition 



Intergenic Spacer, IGS 



ITS region 
D-Loop 

trnK-matK locus 

COI gene 

MHC gene 1 exon 

MHC gene 2 exons 

Single CDS genomic 
DNA 

Single viral CDS 
genomic RNA 



Single CDS mRNA 
rRNA gene 

EST 

WGS (unannotated) 

MIMARKS-Survey 16S 
rRNA sequences 



Soil sample 

MIMARKS-Survey 
using 16S rRNA 
sequences 



For intergenic spacer (IGS) sequences 
between neighbouring genes (e.g. 
psbA-trnH IGS, 16S-23S rRNA IGS). 
Inclusion of the flanking genes is 
allowed 

For the 18S rRNA, ITS1, 5.8S rRNA. 
ITS2, 28S rRNA region, where the 
locations of the boundaries are not 
known 

For mitochondrial D-loop (control 
region) sequences. All D-loops are 
considered partial 

For complete or partial matK gene 
within the chloroplast trnK gene 

For mitochondrial cytochrome oxidase 
subunit 1 genes 

For partial MHC class I or II antigens 
containing one exon 

For partial MHC class I or II antigens 
containing two exons 

For complete or partial single 
non-segmented coding sequence 
(CDS) derived from genomic DNA 

For complete or partial single coding 
sequence (CDS) derived from viral 
genomic RNA. Please do not use for 
viral DNA, peptides processed from 
polyproteins, viral cRNAs, or proviral 
sequences, as these are all annotated 
differently 

For complete or partial single coding 
sequence (CDS) derived from mRNA 
(via cDNA) 

For ribosomal RNA genes from 

prokaryotic, nuclear or mitochondrial 
DNA. All rRNAs are considered 
partial 

For EST (expressed sequence tag) 
submissions 

For unannotated Whole Genome 
Shotgun (WGS) sequences 

For the submission of 16S rRNA 
sequence compliant with the 
MIMARKS Minimal Information 
about a MARKer gene Sequence 
Standard 

For the submission of 16S rRNA 
sequence compliant with the 
MIMARKS Minimal Information 
about a MARKer gene Sequence 
Standard, specific to soil 
metagenomes 



XML-data model, the SRA submission functionality in 
Webin completely hides this complexity. For example, dur- 
ing a raw sequence submission process, users are asked to 
define their raw data file format and are then presented 
with a spreadsheet, which can be either uploaded or filled 
with the required additional information (Figure 3). 

The SRA submission component of Webin is under 
active development and new improvements are deployed 
weekly. Forthcoming improvements include support for 
European Genome-Phenome Archive submissions for 
controlled access raw sequence data, support for checklist 
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SRA Webin - BETA 



European Nucleotide Archive era-drop-30 lmoui 



Home | Hew Submission | Submissions | Studies Samples |~ Bcpertnents Runs Projects 

Start » Study » Sample » Run » Finish 



Please enter the run details by using a spreadsheet or by editing the table below. They will be validated and any errors reported when 
proceeding to the next step. 

Please first choose the file format that you wish to submit. If you have files of different types please submit them in seperate submissions. 

O BAM B 

OSRFW 
05FF*> 

OOne Fastq file (Single) » 

O Two Fastq files (Paired) w 

OOne pair of csfasta/qual files (Single) ** 

O Two pairs of csfasta/qual files (Paired) ** 



Mandatory fields are denoted by ('). 
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Figure 3. Screenshot of raw data definition page in SRA Webin. 



for provision of community standard compliant meta data 
and numerous usability additions. 

INTRODUCTION OF SEQUENCE LENGTH LIMIT 
FOR ASSEMBLED SEQUENCES 

ENA will introduce a sequence length limit for submis- 
sions of assembled sequences. From January 2012, ENA 
will accept sequences < 100 bp only if they fall into one of 
the following sequence categories of 'Ancient DNA\ 'non- 
coding-RNA\ 'Microsatellites 1 or 'Complete Exons 1 . 
Exceptions require the submitter to demonstrate that a 
peer-reviewed journal has accepted a manuscript by the 
submitter, confirming the relevance of the short sequences 
to the scientific community. A validation step will be im- 
plemented in Webin to facilitate implementation of this 
requirement. We encourage submitters to check our 
website for further forthcoming changes announcements 
(http://www.ebi.ac.uk/ena/about/forthcoming_changes) 

HELPDESK AND TRAINING 

The ENA team provides advice and guidance regarding 
ENA services by email through datasubs@ebi.ac.uk. 
Feedback and suggestions related to all of our services 
are very welcome at the same email address. We also 
operate a variety of hands-on training programmes, for 
which details are available at http://www.ebi.ac.uk/ 
training. We strongly encourage submitters to take our 
survey (http://www.surveymonkey.eom/s/ENA_User_ 
Survey_2011) and help us to improve our service. 
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